R case study: web scraping

John R Little

Duke University

Duke University: Land Acknowledgement

I want to take a moment to honor the land in Durham, NC. Duke University sits on the ancestral lands of the Shakori, Eno, and Catawba people. This institution of higher education is built on land stolen from those peoples. These tribes were here before the colonizers arrived. Additionally, this land has borne witness to over 400 years of the enslavement, torture, and systematic mistreatment of African people and their descendants. Recognizing this history is an honest attempt to break out beyond persistent patterns of colonization and to rewrite the erasure of Indigenous and Black peoples. There is value in acknowledging the history of our occupied spaces and places. I hope we can glimpse an understanding of these histories by recognizing the origins of collective journeys.

Demonstration Goals

  • Building on earlier Rfun workshops
  • Web scraping is fundamentally a deconstruction process
  • Introduce just enough HTML/CSS
  • Introduce the library(rvest) package for harvesting websites/HTML
  • Tidyverse iteration with purrr::map - Point out useful documentation & resources

This is a demonstration of leveraging the Tidyverse. This is not a research design or HTML design class. YMMV: data gathering and cleaning are vital and can be complex.

Caveats

  • You will be as successful as the web author(s) were consistent
  • Read and follow the Terms of Use for any target web host
  • Read and honor the host’s robots.txt | https://www.robotstxt.org
  • Always pause to avoid the perception of a Denial of Service (DOS) attack

Scraping


Step one:
Gather

ingest web page data for analysis



rvest::read_html()

Step two: Crawling

systematically (iterating) through a website, gathering data from more than one page (URL)


purrr::map()

Step three: Parsing

Separating the syntactic elements of a web page into meaningful data


rvest::html_nodes()
rvest::html_text()
rvest::html_attr()

HTML

Hypter Text Markup Language

<html>
  <body>
  
    <h1>My First Heading</h1>
    <p>My first paragraph contains a 
    <a href="https://www.w3schools.com">link</a> to
    W3schools.com
    </p>
  
  </body>
</html>

HTML + CSS

Cascading Style Sheets

<html>
<body>

  <div class="abc"> ... </div>
  
  <div id="xyz"> 
    <span class="foo"> ... </span>
  </div>
  
  <span id="bar"> ... </span>

</body>
</html>


for example: https://www.vondel.humanities.uva.nl/style.css

Procedure

The basic workflow of web scraping is

  1. Development

    • Import raw HTML of a single target page (page detail: a leaf or node)
    • Parse the HTML of the test page and gather specific data
    • Check robots.txt and Terms Of Use (TOU)
    • In a web browser, manually browse and understand the target site’s navigation (site navigation: branches)
    • Parse the site navigation and develop an iteration plan
    • Iterate: orchestrate/automate page crawling
    • Perform a dry run with a limited subset of the target web site
    • Construct pauses: avoid the posture of a DNS attack
  2. Production

    • Iterate/Crawl the site (navigation: branches)
    • Parse HTML for each target page (pages: leaves or nodes)

Site tree